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(54) Method and apparatus for identifying splittable paclcets in a multithreated vliw processor 



(57) A metiiod and apparatus are disclosed for allo- 
cating functional units in a multithreaded very large in- 
struction word (VLIW) processor. The present invention 
combines the techniques of conventional very long in- 
struction word (VLIW) architectures and conventional 
multithreaded architectures to reduce execution time 
within an Individual program, as well as across a work- 
load. The present invention utilizes instruction packet 
splitting to recover some efficiency lost with convention- 
al multithreaded architectures. Instruction packet split- 
ting allows an instruction bundle to be partially issued in 
one cycle, with the remainder of the bundle issued dur- 
ing a subsequent cycle. There are times, however, when 
instruction packets cannot be split without violating the 
semantics of the instruction packet assembled by the 
compiler. A packet split identification bit is disclosed that 
allows hardware to efficiently detennlne when it is per- 
missible to split an instruction packet. The split bit in- 
fomns the hardware when splitting is prohibited. The al- 
location hardware assigns as many Instmctions from 
each packet as will fit on the available functional units, 
rather than allocating all instructions in an instruction 
packet at one time, provided the split bit has not been 
set. Those instructions that cannot be allocated to a 
functional units are retained in a ready-to-run register. 
On subsequent cycles, instruction packets in which all 
Instructions have been Issued to functional units are up- 
dated from their thread's instruction stream, while in- 
struction packets with instructions that have been held 
are retained. The functional unit allocation logic can then 



assign instructions from the newly-loaded Instruction 
packets as well as instructions that were not issued from 
the retained instruction packets. 
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Description 

Cross-Referenceto Related Applications 

[0001] The present invention is related to United 
States Patent Application entitled "Method and Appara- 
tus for Allocating Functional Units in a Multithreaded 
Very Large Instruction Word (VLIW) Processor,*' Attor- 
ney Docket Number (Berenbaum 7-2-3-3); United 
States Patent Application entitled "Method and Appara- 
tus for Releasing Functional Units in a Multithreaded 
Very Large Instruction Word (VLIW) Processor," Attor- 
ney Docket Number (Berenbaum 8-3-4-4); and United 
States Patent Application entitled "Method and Appara- 
tus for Splitting Packets in a Multithreaded Very Large 
Instmction Word (VLIW) Processor," Attorney Docket 
Number (Berenbaum 9-4-5-5), each filed contempora- 
neously herewith, assigned to the assignee of the 
present Invention and incorporated by reference herein. 

Field of the Invention 

[0002] The present Invention relates generally to mul- 
tithreaded processors, and, more particularly, to a meth- 
od and apparatus for splitting packets in such multi- 
threaded processors. 

Background of the Invention 

[0003] Computer architecture designs attempt to 
complete workloads more quickly. A number of architec- 
ture designs have been proposed or suggested for ex- 
ploiting program parallelism. Generally, an architecture 
that can issue more than one operation at a time is ca- 
pable of executing a program fasterthan an architecture 
that can only issue one operation at a time. Most recent 
advances in computer architecture have been directed 
towards methods of issuing more than one operation at 
a time and thereby speed up the operation of programs. 
FIG. I illustrates a conventional microprocessor archi- 
tecture 100. Specifically, the microprocessor 100 in- 
cludes a program counter (PC) 110, a register set 120 
and a number of functional units (FUs) 130-N. The re- 
dundantfunctional units (FUs) 130-1 through 130-N pro- 
vide the Illustrative microprocessor architecture 100 
with sufficient hardware resources to perfonn a corre- 
sponding number of operations in parallel. 
[00O4] An architecture that exploits parallelism in a 
program issues operands to more than one functional 
unit at a time to speed up the program execution. A 
number of architectures have been proposed or sug- 
gested with a parallel architecture, including supersca- 
lar processors, very long instruction word (VLIW) proc- 
essors and multithreaded processors, each discussed 
below in conjunction with FIGS. 2, 4 and 5, respectively. 
Generally, a superscalar processor utilizes hardware at 
run-time to dynamically determine If a number of oper- 
ations from a single instruction stream are independent. 



and if so, the processor executes the instructions using 
parallel arithmetic and logic units (ALUs). Two instruc- 
tions are said to be independent if none of the source 
operands are dependent on the destination operands of 

5 any instruction that precedes them. A very long instruc- 
tion word (VLIW) processor evaluates the instructions 
during compilation and groups the operations appropri- 
ately, for parallel execution, based on dependency in- 
fomnation. A multithreaded processor, on the other 

10 hand, executes more than one instruction stream in par- 
allel, rather than attempting to exploit parallelism within 
a single instruction stream, 

[0005] A superscalar processor architecture 200, 
shown in FIG. 2, has a number of functional units that 

15 operate independently, in the event each is provided 
with valid data. For example, as shown In FIG. 2, the 
superscalar processor 200 has three functional units 
embodied as arithmetic and logic units (ALUs) 230-N, 
each of which can compute a result at the same time, 

20 The superscalar processor 200 includes a front-end 
section 208 having an instruction fetch block 210, an in- 
struction decode block 21 5, and an instruction sequenc- 
ing unit 220 (issue block). The Instruction fetch block 
210 obtains instructions from an input queue 205 of a 

2s single threaded Instruction stream. The Instruction se- 
quencing unit 220 identifies independent instructions 
that can be executed simultaneously in the available 
arithmetk: and logic units (ALUs) 230-N, in a known 
manner. The refine block 250 allows the instructions to 

30 complete, and also provides buffering and reordering for 
writing results back to the register set 240. 
[0006] In the program fragment 31 0 shown In FIG. 3, 
instructions in locations LI , L2 and L3 are independent, 
in that none of the source operands in instructions L2 

35 and L3 are dependent on the destination operands of 
any instmction that precedes them. When the program 
counter (PC) is set to location L1 , the Instruction se- 
quencing unit 220 will look ahead In the instruction 
stream and detect that the instructions at L2 and L3 are 

40 independent, and thus all three can be issued simulta- 
neously to the three available functional units 230-N. For 
a more detailed discussion of superscalar processors, 
see, for example, James. E. Smith and Gurindar. 8. So- 
hi, The Microarchitecture of Superscalar Processors," 

45 Proc. of the IEEE (Dec. 1995), Incorporated by refer- 
ence herein. 

[0007] As previously indicated, a very long instruction 
word (VLIW) processor 400, shown in FIG.4, relies on 
software to detect data parallelism at compile time from 

50 a single instruction stream, rather than using hardware 
to dynamically detect paralielism at run time. A VLIW 
compiler, when presented with the source code that was 
used to generate the code fragment 31 0 in FIG. 3 , would 
detect the Instruction independence and construct a sin- 

55 gle, very long instruction comprised of all three opera- 
tions. At run time, the issue logic of the processor 400 
would issue this wide Instruction in one cycle, directing 
data to ail available functional units 430-N. As shown in 
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FIG, 4, the very long instruction word (VLIW) processor 
400 includes an integrated fetch/decode block 420 that 
obtains the previously grouped instructions 410 from 
memory. For a more detailed discussion of very long in- 
struction word (VLIW) processors, see, for example, 
Burton J. Smith, "Architecture and Applications of the 
HEP Multiprocessor Computer System," SPIE Real 
Time Signal Procesing IV, 241-248 (1981), incorporated 
by reference herein. 

[0008] One variety of VLIW processors, for example, 
represented by the Multiflow architecture, discussed In 
Robert P. Colwell et al., "A VLIW Architecture for a Trace 
Scheduling Compiler," IEEE Transactions on Comput- 
ers (August 1988), uses a fixed-width Instruction, In 
which predefined fields direct data to all functional units 
43a-N at once. When all operations specified in the wide 
instruction are completed, the processor issues a new, 
multi-operation instruction. Some more recent VLIW 
processors, such as the C6x processor commercially 
available from Texas Instruments, of Dallas, TX and the 
EPIC IA-64 processor commercially available from Intel 
Corp, of Santa Clara, CA, instead use a variable-length 
instruction pacl<et, which contains one or more opera- 
tions bundled together. 

[0009] A multithreaded processor 500, shown in FIG. 
5, gains performance improvements by executing more 
than one instruction stream in parallel, rather than at- 
tempting to exploit parallelism within a single instruction 
stream. The multithreaded processor 500 shown in FIG. 
5 includes a program counter 510-N, a register set 
520-N and a functional unit 530-N, each dedicated to a 
corresponding instruction stream N. Alternate imple- 
mentations of the multithreaded processor 500 have uti- 
lized a single functional unit 530, with several register 
sets 520-N and program counters51 0-N. Such alternate 
multithreaded processors 500 are designed in such a 
way that the processor 500 can switch instruction issue 
from one program counter/register set 510-N/520-N to 
another program counter/register set 510-N/520-N in 
one or two cycles. A long latency instruction, such as a 
LOAD instruction, can thus be overlapped with shorter 
operations from other Instruction streams. The TERA 
MTA architecture, commercially available from Tera 
Computer Company, of Seattle, WA, Is an example of 
this type. 

[0010] An extension of the multithreaded architecture 
500, referred to as Simultaneous Multithreading, com- 
bines the superscalar architecture, discussed above in 
conjunction with FIG. 2, with the multithreaded designs, 
discussed above in conjunction with FIG. 5. For a de- 
tailed discussion of Simultaneous Multithreading tech- 
niques, see, for example, Dean Tullsen et al., "Simulta- 
neous Multithreading: l\4aximizing On-Chip Parallelism, 
" Proc. of the 22nd Annual Int'l Symposium on Computer 
Architecture, 392-403 (Santa Margherita Ligure, Italy, 
June 1995), incorporated by reference herein. General- 
ly, in a Simultaneous Multithreading architecture, there 
is a pool of functional units, any number of which may 



be dynamically assigned to an instruction which can is- 
sue from any one of a number of program counter/reg- 
ister set structures. By sharing the functional units 
among a number of program threads, the Simultaneous 

5 Multithreading architecture can mal<e more efficient use 
of hardware than that shown in FIG. 5. 
[0011] While the combined approach of the Simulta- 
neous Multithreading architecture provides improved ef- 
ficiency over the individual approaches of the supersca- 

10 lar architecture orthe multithreaded architecture, Simul- 
taneous Multithreaded architectures still require elabo- 
rate issue logic to dynamically examine instruction 
streams in order to detect potential parallelism. In addi- 
tion, when an operation tal<es multiple cycles, the in- 

15 struction issue logic may stall, because there is no other 
source of operations available. Conventional multi- 
threaded processorss issue instructions from a set of 
instructions simultaneously, with the functional units de- 
signed to accommodate the widest potential issue. A 

20 need therefore exists for a multithreaded processor ar- 
chitecture that does not require a dynamic detemriina- 
tion of whether or not two instruction streams are inde- 
pendent. A further need exists for a multithreaded archi- 
tecture that provides simultaneous multithreading. Yet 

25 another need exists for a method and apparatus that Im- 
prove the utilization of processor resources for each cy- 
cle. 

Summary of the Invention 

30 

[0012] Generally, a method and apparatus are dis- 
closed for allocating functional units in a multithreaded 
very large instmction word (VLIW) processor. The 
present invention combines the techniques of conven- 
es tional very long instruction word (VLIW) architectures 
and conventional multithreaded architectures. The com- 
bined architecture of the present invention reduces ex- 
ecution time within an individual program, as well as 
across a worl<ioad. The present invention utilizes in- 
40 struction pacl<et splitting to recover some efficiency lost 
with conventional multithreaded architectures, instruc- 
tion paclcet splitting allows an instruction bundle to be 
partially Issued in one cycle, with the remainder of the 
bundle Issued during a subsequent cycle. Thus, the 
45 present invention provides greater utilization of hard- 
ware resources (such as the functional units) and a low- 
er elapsed time across a workload comprising multiple 
threads. 

[001 3] There are times when instmction packets can- 
so not be split without violating the semantics of the instruc- 
tion packet assembled by the compiler. In particular, the 
input value of a register is assumed to be the same for 
instructions in a packet, even If the register is modified 
by one of the instructions in the packet. If the packet is 
55 split, and a source register for one of the instructions in 
the second part of the packet is modified by one of the 
instructions in the first part of the packet, then the com- 
piler semantics will be violated. 
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[0014] Thus, the present invention utilizes a packet 
spilt identification bit that allows hardware to efficiently 
detemnine when it is permissible to split an instruction 
packet. Instruction packet splitting increases throughput 
across all Instruction threads, and reduces the number s 
of cycles that the functional units rest idle. The allocation 
hardware of the present Invention assigns as many in- 
structions from each packet as will fit on the available 
functional units, rather than allocating all instructions in 
an Instruction packet at one time, provided the packet 
split identification bit has not been set. Those instruc- 
tions that cannot be allocated to a functional units are 
retained in a ready-to-run register On subsequent cy- 
cles. Instruction packets in which all Instructions have 
been issued to functional units are updated from their 
thread's instruction stream, while instruction packets 
with instructions that have been held are retained. The 
functional unit allocation logic can then assign instruc- 
tions from the newly-loaded instruction packets as well 
as Instructions that were not issued from the retained 
instruction packets. 

[0015] The present invention utilizes a compiler to de- 
tect parallelism in a multithreaded processor architec- 
ture. Thus, a multithreaded VLIW architecture is dis- 
closed that exploits program parallelism by issuing mul- 
tiple instructions, in a similar manner to single threaded 
VLIW processors, from a single program sequencer, 
and also supporting multiple program sequencers, as In 
simultaneous multithreading but with reduced complex- 
ity in the Issue logic, since a dynamic determination Is 
not required. 

[001 6] A more complete understanding of the present 
invention, as well as further features and advantages of 
the present invention, will be obtained by reference to 
the following detailed description and drawings. 

Brief Description of the Drawings 

[0017] 

FIG. 1 illustrates a conventional generalized micro- 
processor architecture; 

FIG. 2 is a schematic block diagram of a conven- 
tional superscalar processor architecture; 
FIG. 3 is a program fragment illustrating the inde- 
pendence of operations; 

FIG. 4 is a schematic block diagram of a conven- 
tional very long instruction word (VLIW) processor 
architecture; 

FIG. 5 is a schematic block diagram of a conven- 
tional multithreaded processor; 
FIG. 6 illustrates a multithreaded VLiW processor 
In accordance with the present invention; 
FIG. 7A illustrates a conventional pipeline for a mul- 
tithreaded processor; 

FIG. 7B illustrates a pipeline for a multithreaded 
processor in accordance with the present invention; 
FIG. 8 is a schematk; block diagram of an imple- 



mentation of the allocate stage of FIG. 7B; 
FIG. 9 illustrates the execution of three threads TA- 
TC for a conventional multithreaded implementa- 
tion, where threads B and C have higher priority 
than thread A; 

FIGS. IDA and 108 illustrate the operation of in- 
struction packet splitting in accordance with the 
present invention; 

FIG. 11 Is a program fragment that may not be split 

in accordance with the present invention; 

FIG. 12 is a program fragment that may be split In 

accordance with the present invention; 

FIG. 13 Illustrates a packet con^esponding to the 

program fragment of FIG. 12 where the instruction 

splitting bit has been set; 

FIG. 1 4 is a program fragment that may not be split 
in accordance with the present invention; and 
FIG. 15 illustrates a packet con-esponding to the 
program fragment of FIG. 14 where the instruction 
splitting bit has not been set. 

Detailed Description 

[0018] FIG. 6 illustrates a Multithreaded VLIW proc- 
essor 600 In accordance with the present Invention. As 
shown in FIG. 6, there are three instruction threads, 
namely, thread A (TA), thread B (TB) and thread C (TC), 
each operating at instmction number n. In addition, the 
illustrative Multithreaded VLIW processor 600 includes 
nine functional units 620-1 through 620-9, which can be 
allocated independently to any thread TA-TC. Since the 
number of instructions across the illustrative three 
threads TA-TC is nine and the illustrative number of 
available functional units 620 is also nine, then each of 
the instructions from all three threads TA-TC can issue 
their instruction packets In one cycle and move onto in- 
struction n+1 on the subsequent cycle. 
[0019] It is noted that there is generally a one-to-one 
correspondence between instructions and the operation 
specified thereby. Thus, such temns are used inter- 
changeably herein. It is further noted that in the situation 
where an instruction specifies multiple operations, it is 
assumed that the multithreaded VLIW processor 600 In- 
cludes one or more multiple-operation functional units 
620 to execute the instruction specifying multiple oper- 
ations. An example of an architecture where instructions 
specifying multiple operations may be processed is a 
complex instruction set computer (CISC). 
[0020] The present invention allocates instructions to 
functional units to issue multiple VLIW instructions to 
multiple functional units in the same cycle. The alloca- 
tion mechanism of the present invention occupies a 
pipeline stage just before arguments are dispatched to 
functional units. Thus, FIG. 7A illustrates a conventional 
pipeline 700 comprised of a fetch stage 710, where a 
packet is obtained from memory, a decode stage 720, 
where the required functional units and registers are 
identified for the fetched instructions, and an execute 
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stage 730, where the specified operations are per- 
formed and the results are processed. 
[0021] Thus, in a conventional VLIW architecture, a 
packet containing up to K instructions is fetched each 
cycle (in the fetch stage 710). Up to K instructions are 
decoded in the decode stage 720 and sent to (up to) K 
functional units (FUs). The registers con^espondlng to 
the instructions are read, the functional units operate on 
them and the results are written back to registers in the 
execute stage 730. It Is assumed that up to three regis- 
ters can be read and up to one register can be written 
per functional unit. 

[0022] FIG. 7B illustrates a pipeline 750 in accord- 
ance with the present invention , where an allocate stage 
780, discussed further below in conjunction with FIG. 8, 
is added for the implementation of multithreaded VLIW 
processors. Generally, the allocate stage 780 deter- 
mines how to group the operations togetherto maximize 
efficiency. Thus, the pipeline 750 includes a fetch stage 
760, where up to N packets are obtained from memory, 
a decode stage 770, where the functional units and reg- 
isters are identified for the fetched Instructions (up to 
N*K instructions), an allocate stage 780, where the ap- 
propriate instructions are selected and assigned to the 
FUs, and an execute stage 790, where the specified op- 
erations are performed and the results are processed. 
[0023] In the multithreaded VLIW processor 600 of 
the present invention, up to N threads are supported in 
hardware. N thread contexts exist and contain ail pos- 
sible registers of a single thread and all status Infonna- 
tion required. A multithreaded VLIW processor 600 has 
M functional units, where l\^ is greater than or equal to 
K. The modified pipeline stage 750, shown in FIG. 7B, 
works in the following manner. In each cycle, up to N 
packets (each containing up to K instructions) are 
fetched at the fetch stage 760. The decode stage 770 
decodes up to N*K Instructions and detennlnes their re- 
quirements and the registers read and written. The al- 
locate stage 780 selects M out of (up to) N*K instructions 
and forwards them to the M functional units. It Is as- 
sumed that each functional unit can read up to 3 regis- 
ters and write one register. In the execute stage 790, up 
to M functional units read up to 3*M registers and write 
up to M registers. 

[0024] The allocate stage 780 selects for execution 
the appropriate M instructions from the (up to) N * K in- 
structions that were fetched and decoded at stages 760 
and 770. The criteria for selection are thread priority or 
resource availability or both. Under the thread priority 
criteria, different threads can have different priorities. 
The allocate stage 780 selects and forwards the packets 
(or Instructions from packets) for execution belonging to 
the thread with the highest priority according to the pri- 
ority policy implemented. A multitude of priority policies 
can be implemented. For example, a priority policy for 
a multithreaded VLIW processor supporting N contexts 
(N hardware threads) can have N priority levels. The 
highest priority thread in the processor is allocated be- 



fore any other thread. Among threads with equal priority, 
the thread that waited the longest for allocation is pre- 
ferred. 

[0025] Under the resource availability criteria, a pack- 

5 et (having up to K instructions) can be allocated only if 
the resources (functional units) required by the packet 
are available for the next cycle. Functional units report 
their availability to the allocate stage 780. 
[0026] FIG. 8 illustrates a schematic block diagram of 

10 an implementation of the allocate stage 780. As shown 
in FIG. 8, the hardware needed to implement the allo- 
cate stage 780 includes a priority encoder 81 0 and two 
crossbar switches 820, 830. Generally, the priority en- 
coder 81 0 examines the state of the multiple operations 

15 In each thread, as well as the state of the available func- 
tional units. The priority encoder 81 0 selects the packets 
that are going to execute and sets up the first crossbar 
switch 820 so that the appropriate register contents are 
transferred to the functional units at the beginning of the 

20 next cycle. The output of the priority encoder 81 0 con- 
figures the first crossbar switch 820 to route data from 
selected threads to the appropriate functional units. This 
can be accomplished, for example, by sending the reg- 
ister identifiers (that include a thread identifier) to the 

25 functional units and letting the functional units read the 
register contents via a separate data networi< and using 
the crossbar switch 81 0 to move the appropriate register 
contents to latches that are read by the functional units 
at the beginning of the next cycle 

30 [0027] Out of the N packets that are fetched by the 
fetch stage 760 (FIG. 7B), the priority encoder 81 0 se- 
lects up to N packets for execution according to priority 
and resource availability. In other words, the priority en- 
coder selects the highest priority threads that do not re- 

35 quest unavailable resources for execution. It then sets 
up the first crossbar switch 810. The input crossbar 
switch 810 routes up to 3K*N inputs to up to 3*M outputs. 
The first crossbar switch 81 0 has the ability to transfer 
the register identifiers (or the contents of the appropriate 

40 registers) of each packet to the appropriate functional 
units. 

[0028] Since there are up to N threads that can be se- 
lected In the same cycle and each thread can issue a 
packet of up to K instructions and each Instruction can 

45 read up to 3 registers there are 3K*N register identifiers 
to select from. Since there are only M functional units 
and each functional unit can accept a single instruction, 
there are only 3M register identifiers to be selected. 
Therefore, the crossbar switch implements a 3K*N to 

50 3M routing of register identifiers (or register contents). 
[0029] The output crossbar switch 830 routes M in- 
puts to N*M or N*K outputs. The second crossbar switch 
830 is set up at the appropriate time to transfer the re- 
sults of the functional units back to the appropriate reg- 

55 isters. The second crossbar switch 830 can be imple- 
mented as a separate network by sending the register 
identifiers (that contain a thread identifier) to the func- 
tional units. When a functional unit computes a result, 
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the functional unit routes the result to the given register 
identifier There are M results that have to be routed to 
up to N threads. Each thread can accept up to K results. 
The second crossbar switch 830 routes M results to N*K 
possible destinations. The second crossbar switch 830 
can be implemented as M buses that are connected to 
all N register files. In this case, the routing becomes M 
results to N*M possible destinations (if the register files 
have the ability to accept M results). 
[0030] In a conventional single-threaded VLIW archi- 
tecture, alt operations in an Instruction pacl<et are Issued 
simultaneously. There are always enough functional 
units available to issue a packet. When an operation 
takes multiple cycles, the Instruction Issue logic may 
stall, because there is no other source of operations 
available. In a multithreaded VLIW processor in accord- 
ance with the present Invention, on the other hand, 
these restrictions do not apply. 
[0031] FIG. 9 illustratesthe execution of three threads 
TA-TC for a conventional multithreaded Implementation 
(without the benefit of the present invention), where 
threads B and C have higher priority than thread A. 
Since thread A runs at the lowest priority, Its operations 
will be the last to be assigned. As shown in FIG. 9, five 
functional units 920 are assigned to implement the five 
operations in the current cycle of the higher priority 
threads TB and TC. Thread A has four operations, but 
there are only two functional units 920 available. Thus, 
thread A stalls for a conventional multithreaded imple- 
mentation. 

[0032] In order to maximize throughput across all 
threads, and minimize the number of cycles that the 
functional units rest Idle, the present invention utilizes 
instruction packet splitting. Instead of allocating all op- 
erations in an Instmctlon packet at one time, the alloca- 
tion hardware 780, discussed above in conjunction with 
FIG. 8, assigns as many operations from each packet 
as will fit on the available functional units. Those oper- 
ations that will not fit are retained in a ready-to-run reg- 
ister 850 (FIG. 8). On subsequent cycles, instruction 
packets in which all operations have been issued to 
functional units are updated from their thread's instruc- 
tion stream, while Instruction packets with operations 
that have been held are retained. The functional unit al- 
location logic 780 can then assign operations from the 
newly-loaded instruction packets as well as operations 
that were not issued from the retained instruction pack- 
ets. 

[0033] The operation of instruction packet splitting In 
accordance with the present Invention Is illustrated In 
FIGS. 10A and 108. In FIG. 10A, there are three 
threads, each with an instruction packet from location n 
ready to run at the start of cycle x. Thread A runs at the 
lowest priority, so Its operations will be the last to be as- 
signed. Threads 8 and C require five of the seven avail- 
able functional units 1 020 to execute. Only two function- 
al units 1 020-2 and 1 020-6 remain, so only the first two 
operations from thread A are assigned to execute. Ail 



seven functional units 1020 are now fully allocated. 
[0034] At the completion of cycle x, the instruction 
packets tor threads B and C are retired. The instruction 
issue logic associated with the threads replaces the in- 
5 struction packets with those for address n-tl, as shown 
in FIG. 10B. Since the instruction packet for thread A is 
not completed, the packet from address n is retained, 
with the first two operations marked completed. On the 
next cycle, x+l, Illustrated in FIG. 10B, the final two op- 
10 orations from thread A are allocated to functional units, 
as well as all the operations from threads B and C. Thus, 
the present invention provides greater utilization of 
hardware resources (I.e., the functional units 1020) and 
a lower elapsed time across a workload comprising the 
15 multiple threads. 

[0035] An instruction packet cannot be split without 
verifying that splitting would violate the semantics of the 
instruction packet assembled by the compiler. In partic- 
ular, the input value of a register is assumed to be the 
same for instructions in a packet, even if the register is 
modified by one of the instructions In the packet. If the 
packet is split, and a source register for one of the In- 
structions in the second part of the packet is modified 
by one of the instructions in the first part of the packet, 
then the compiler semantics will be violated. This Is Il- 
lustrated in the program fragment 1110 in FIG, 11: 
[0036] As shown in FIG. 1 1 , If Instructions in locations 
L1 , L2 and L3 are assembled into an instruction packet, 
and RO = 0, R1 = 2, and R2 = 3 before the packet exe- 
cutes, then the value of RO will be 5 after the packet 
completes. On the other hand, if the packet is split and 
instruction LI executes before L3, then the value of RO 
will be 2 after the packet completes, violating the as- 
sumptions of the compiler. 

[0037] One means to avoid packet splitting that would 
violate program semantics is to add hardware to the in- 
struction issue logic that would identify when destination 
registers are used as sources in other instructions in an 
instruction packet. This hardware would inhibit packet 
splitting when one of these read-after-write hazards ex- 
ists. The mechanism has the disadvantage of requiring 
additional hardware, which takes area resources and 
could possibly impact critksal paths of a processor and 
therefore reduce the speed of the processor. 
[0038] A compiler can easily detect read-after-write 
hazards in an instruction packet. It can therefore choose 
to never assemble Instructions with these hazards Into 
an instruction packet. The hardware would then be 
forced to run these instructions serially, and thereby 
avoid the hazard. Any Instruction packet that had a read- 
after-write hazard would be considered in error, and the 
architecture would not guarantee the results. Although 
safe from semantic violations, this technique has the 
disadvantage that It does not exploit potential parallel- 
ism In a program, since the parallelism available in the 
instruction packet with a hazard is lost, even if the un- 
derlying hardware would not have split the packet. 
[0039] The present invention combines compiler 
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register for storing for execution in a later cycle an 
indication of those instructions from a given instruc- 
tion pacl<et that cannot be allocated to a functional 
unit in a given cycle. 

5 

5. The multithreaded very large instruction word 
(VLiW) processor of claim 4, wherein instruction 
packets In which all instaictions have been issued 
to functional units are updated from the instruction 
10 stream of said thread. 



knowledge with a small amount of hardware. In the Il- 
lustrative implementation, a single bit, referred to as the 
split bit, is placed In the prefix of a multi-Instruction pack- 
et to inform the hardware that this packet cannot be split. 
Since the compiler knows which packets have potential 
read-after-write hazards, it can set this bit in the packet 
prefix whenever a hazard occurs. At run-time, the hard- 
ware will not split a packet with the bit set, but instead 
will wait until all instructions in the packet can be run in 
parallel. This concept is illustrated in FIGS. 12 through 
15. 

[0040] The compiler detects that the three instruction 
sequence 1 21 0 in FIG. 1 2 can be split safely, so the split 
bit is set to 1, as shown in FIG. 13. In FIG. 14, on the 
other hand, the three instruction sequence 1 41 0 cannot 
be split, since there is a read-after-write hazard between 
instructions LI and L3. The split bit is therefore set to 0, 
as shown In FIG. 15 

[0041] It is to be understood that the embodiments 
and variations shown and described herein are merely 
illustrative of the principles of this invention and that var- 
ious modifications may be implemented by those skilled 
in the art without departing from the scope and spirit of 
the invention. 



Claims 

1 . A multithreaded very large instniction word (VLIW) 
processor, comprising: 

a plurality of functional units for executing a plu- 
rality of instructions from a multithreaded in- 
struction stream, said instructions being 
grouped into packets by a compiler, said com- 
piler including an indication in said packet of 
whether said instructions in said packet may be 
split; and 

an allocator that selects instructions from said 
instruction stream and forwards said instruc- 
tions to said plurality of functional units, said al- 
locator assigning instructions from at least one 
of said instruction packets to a plurality of said 
functional units if said indication indicates said 
packet may be split. 

2. The multithreaded very large instruction word 
(VLIW) processor of claim 1 , wherein said indication 
is a split bit. 

3. The multithreaded very large instruction word 
(VLIW) processor of claim 1 , wherein said allocator 
assigns as many instructions from a given instruc- 
tion packet as pennitted by an availability of said 
functional units. 

4. The multithreaded very large instruction word 
(VLIW) processor of claim 1 , further comprising a 



6. The multithreaded very large instruction word 
(VLIW) processor of claim 4, wherein instruction 
packets with instructions indicated in said register 

15 are retained. 

7. A method of processing instructions from a multi- 
threaded instruction stream in a multithreaded very 
large instnjction word (VLIW) processor, compris- 

20 ing the steps of: 

executing said instructions using a plurality of 
functional units, said instructions being 
grouped into packets by a compiler, said com- 
25 piier including an Indteation in said packet of 

whether said instructions in said packet may be 
split; and 

assigning instructions from at least one of said 
instruction packets to a plurality of said func- 
30 tlonal units if said indication indicates said 

packet may be split; and 
fonwarding said selected instructions to said 
plurality of functional units. 

35 8. The method of claim 7, wherein said indication is a 
split bit. 

9. The method of claim 7, wherein said assigning step 
assigns as many instructions from a given instruc- 

40 tion packet as permitted by an availability of said 
functional units. 

10. The method of claim 7, further comprising the step 
of storing for execution in a later cycle an Indication 

45 of those Instructions from a given instruction packet 
that cannot be allocated to a functional unit in a giv- 
en cycle. 

11. The method of claim 10, wherein instruction pack- 
50 ets in which all instructions have been issued to 

functional units are updated from the instruction 
stream of said thread. 

12. The method of claim 10, wherein instruction pack- 
55 ets with instructions indicated in said register are 

retained. 

13. An article of manufacture for processing instruc- 
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tions from an Instruction stream having a plurality 
of threads in a multithreaded very large Instruction 
word (VLIW) processor, comprising: 

a computer readable medium having compu- 
ter readable program code means embodied ther- 5 
eon, said computer readable program code means 
comprising program code means for causing a com- 
puter to: 

execute said instmctions using a plurality of io 
functional units, said instructions being 
grouped Into packets by a compiler, said com- 
piler Including an indication In said packet of 
whether said instructions in said packet may be 
split; and is 
assign instructions from at least one of said in- 
struction packets to a plurality of said functional 
units if said indication indicates said packet 
may be split; and 

forward said selected Instructions to said ptu- ^ 
rality of functional units. 

14. A compiler for a multithreaded very large instruction 
word (VLIW) processor, comprising: 

25 

a memory for storing computer-readable code; 
and 

a processor operativeiy coupled to said mem- 
ory, said processor configured to: 

30 

translate Instructions from a program Into 
a machine language; 

group a plurality of said instructions Into a 
packet; and 

provide an indication with said packet indi- 35 
eating whether said instructions in said 
packet may be split. 

15. The compiler of claim 14, wherein said instruction 
packet can be split provided the semantics of the 40 
instruction packet assembled by the compiler are 

not violated. 

16. The compiler of claim 14, wherein said Instruction 
packet can be split provided a source register for 45 
one of the instructions in a first part of said packet 

is not modified by one of the instructions In a second 
part of said packet. 

17. An article of manufacture for compiling instructions so 
from an instruction stream having a plurality of 
threads for use in a multithreaded very large instruc- 
tion word (VLIW) processor, comprising: 

a computer readable medium having compu- 
ter readable program code means embodied ther- ss 
eon, said computer readable program code means 
comprising program code means for causing a com- 
puter to: 



translate instructions from a program into a ma- 
chine language; 

group a plurality of said instructions into a pack- 
et; and 

provide an indication with said packet indicat- 
ing whether said instructions in said packet may 
be split. 
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